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Abstract 

A general, practical method for hand- 
ling sparse data that avoids held-out 
data and iterative reestimation is derived 
from first principles. It has been tested 
on a part-of-speech tagging task and out- 
performed (deleted) interpolation with 
context-independent weights, even when 
the latter used a globally optimal para- 
meter setting determined a posteriori. 

1 Introduction 

Sparse data is a perennial problem when applying 
statistical techniques to natural language proces- 
sing. The fundamental problem is that there is of- 
ten not enough data to estimate the required sta- 
tistical parameters, i.e., the probabilities, directly 
from the relative frequencies. This problem is ac- 
centuated by the fact that in the search for more 
accurate probabilistic language models, more and 
more contextual information is added, resulting in 
more and more complex conditionings of the cor- 
responding conditional probabilities. This in turn 
means that the number of observations tends to 
be quite small for such contexts. Over the years, 
a number of techniques have been proposed to 
handle this problem. 

One of two different main ideas behind these 
techniques is that complex contexts can be gene- 
ralized, and data from more general contexts can 
be used to improve the probability estimates for 
more specific contexts. This idea is usually re- 
ferred to as back-off smoothing, see (Katz 1987). 
These techniques typically require that a sepa- 
rate portion of the training data be held out from 
the parameter-estimation phase and saved for de- 
termining appropriate back-off weights. Further- 
more, determining the back-off weights usually re- 
quires resorting to a time-consuming iterative ree- 
stimation procedure. A typical example of such a 
technique is "deleted interpolation" , which is de- 
scribed in Section 5.1 below. 

The other main idea is concerned with im- 
proving the estimates of low-frequency, or no- 



frequency, outcomes apparently without trying to 
generalize the conditionings. Instead, these tech- 
niques are based on considerations of how popu- 
lation frequencies in general tend to behave. Ex- 
amples of this are expected likelihood estimation 
(ELE), see Section 5.2 below, and Good- Turing 
estimation, see (Good 1953). 

We will here derive from first principles a practi- 
cal method for handling sparse data that does not 
need separate training data for determining the 
back-off weights and which lends itself to direct 
calculation, thus avoiding time-consuming reesti- 
mation procedures. 

2 Linear Successive Abstraction 

Assume that we want to estimate the conditional 
probability P(x \ C) of the outcome x given a 
context C from the number of times N x it occurs 
in N = \C\ trials, but that this data is sparse. 
Assume further that there is abundant data in a 
more general context C" D C that we want to use 
to get a better estimate of P(x \ C). The idea is to 
let the probability estimate P(x \ C) in context C 
be a function g of the relative frequency f(x \ C) 
of the outcome x in context C and the probability 
estimate P(x \ C") in context C": 

P(x\C) = g(f(x\C),P(x\C'j) 

Let us generalize this scenario slightly to the si- 
tuation were we have a sequence of increasingly 
more general contexts C' m C C m -i C ... C C\, 
i.e., where there is a linear order of the various 
contexts C'k- We can then build the estimate of 
P(x | C'k) on the relative frequency f(x \ C'k) 
in context C'k and the previously established esti- 
mate of P(x | C'k-i)- We call this method li- 
near successive abstraction. A simple example is 
estimating the probability P(x | / n _j + i, ...,/„) of 
word class x given / n _j + i, . . . , l n , the last j let- 
ters of a word li,...,l n . In this case, the esti- 
mate will be based on the relative frequencies 
f(x | l n -j + l, l n ), ...,f(x | l n ), f(x). 

We will here consider the special case when the 
function g is a weighted sum of the relative fre- 
quency and the previous estimate, appropriately 



renormalized: 



P(x | C k ) 



f(x | C k ) + 6P(x\ Cjfe_i) 
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We want the weight 6 to depend on the context 
Ck, and in particular be proportional to some 
measure of how spread out the relative frequen- 
cies of the various outcomes in context C'k are from 
the statistical mean. The variance is the quadratic 
moment w.r.t. the mean, and is thus such a mea- 
sure. However, we want the weight to have the 
same dimension as the statistical mean, and the 
dimension of the variance is obviously the square 
of the dimension of the mean. The square root 
of the variance, which is the standard deviation, 
should thus be a suitable quantity. For this rea- 
son we will use the standard deviation in C'k as 
a weight, i.e., 6 = a(C'k)- One could of course 
multiply this quantity with any reasonable real 
constant, but we will arbitrarily set this constant 
to one, i.e., use cr(C'k) itself. 

In linguistic applications, the outcomes are 
usually not real numbers, but pieces of lingui- 
stic structure such as words, part-of-speech tags, 
grammar rules, bits of semantic tissue, etc. This 
means that it is not quite obvious what the stan- 
dard deviation, or the statistical mean for that 
matter, actually should be. To put it a bit more 
abstractly, we need to calculate the standard de- 
viation of a non- numerical random variable. 

2.1 Deriving the Standard Deviation 

So how do we find the standard deviation of a 
non-numerical random variable? One way is to 
construct an equivalent numerical random varia- 
ble and use the standard deviation of the latter. 
This can be done in several different ways. The 
one we will use is to construct a numerical random 
variable with a uniform distribution that has the 
same entropy as the non-numerical one. Whether 
we use a discrete or continuous random variable 
is, as we shall see, of no importance. 

We will first factor out the dependence on the 
context size. Quite in general, if £w is the sample 
mean of N independent observations of any nu- 
merical random variable £ with variance Cq, i.e., 

£n = if E;=i&'> tnen 
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In our case, the number of observations N is sim- 
ply the size of the context C'k , by which we mean 
the number of times C'k occurred in the training 
data, i.e., the frequency count of C'k, which we will 
denote \C'k\- Since the standard deviation is the 
square root of the variance, we have 



T(C'k) 



vojC'k) 



Here <7o does not depend on the number of obser- 
vations in context C'k , only on the underlying pro- 
bability distribution conditional on context C'k- 

To estimate <Jo{Ck), we assume that we have eit- 
her a discrete uniform distribution on {1, . . . , M} 
or a continuous uniform distribution on [0,M] 
that is as hard to predict as the one in C'k in the 
sense that the entropy is the same. The entropy 
H[£] of a random variable £ is the expectation va- 
lue of the logarithm of In the discrete case 
we thus have 

H[£] = E[-lnP(0] = Y^-PMhiPixi) 

i 

Here P(xi) is the probability of the random va- 
riable £ taking the value Xi, which is jg- for all 
possible outcomes Xi and zero otherwise. Thus, 
the entropy is InM: 



M 

V -P( Xi )\nP(xi) =V In — 
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The continuous case is similar. We thus have that 

InM = E[C k ] or M = e E[Ck] 

The variance of these uniform distributions is 
in the continuous case and M x ~ 1 in the dis- 
crete case. We thus have 



M 



1 
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Unfortunately, the entropy H[Cfc] depends on the 
probability distribution of context Ck and thus 
on <7o(Cfc). Since we want to avoid trying to solve 
highly nonlinear equations, and since we have ac- 
cess to an estimate of the probability distribution 
of context C'k-i, we will make the following ap- 
proximation: 



<r(Ck) 



^o(C'k-i) 
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It is starting to look sensible to specify a 1 instead 
of a, i.e., instead of ^ , + , 17 P , we will write 17 _•! f ^ ■ 

' ' 1 + a ' a 1 + 1 

2.2 The Final Recurrence Formula 

We have thus established a recurrence formula for 
the estimate of the probability distribution in con- 
text Ck given the estimate of the probability dis- 
tribution in context C'k-i and the relative frequen- 
cies in context C'k'- 

P(x\Ck) = (1) 
viCkY 1 !^ I Ck) + P(x | Ck-i) 



and 



v(Ck)- 1 



(riCk)- 1 + 1 



We will start by estimating the probability distri- 
bution in the most general context C\, if necessary 



directly from the relative frequencies. Since this 
is the most general context, this will be the con- 
text with the most training data. Thus it stands 
the best chances of the relative frequencies being 
acceptably accurate estimates. This will allow us 
to calculate an estimate of the probability distri- 
bution in context C'2, which in turn will allow us 
to calculate an estimate of the probability distri- 
bution in context C3, etc. We can thus calcu- 
late estimates of the probability distributions in 
all contexts C\ , . . . , C m . 

We will next consider some examples from part- 
of-speech tagging. 

3 Examples from PoS Tagging 

Part-of-speech (PoS) tagging consists in assigning 
to each word of an input text a (set of) tag(s) from 
a finite set of possible tags, a tag palette or a tag 
set. The reason that this is a research issue is that 
a word can in general be assigned different tags 
depending on context. In statistical tagging, the 
relevant information is extracted from a training 
text and fitted into a statistical language model, 
which is then used to assign the most likely tag to 
each word in the input text. 

The statistical language model usually consists 
of lexical probabilities, which determine the pro- 
bability of a particular tag conditional on the par- 
ticular word, and contextual probabilities, which 
determine the probability of a particular tag con- 
ditional on the surrounding tags. The latter con- 
ditioning is usually on the tags of the neighbouring 
words, and very often on the TV — 1 previous tags, 
so-called (tag) TV-gram statistics. These proba- 
bilities can be estimated either from a pretagged 
training corpus or from untagged text, a lexicon 
and an initial bias. We will here consider the for- 
mer case. 

Statistical taggers usually work as follows: 
First, each word in the input word string 
W\ , • • • , W n is assigned all possible tags according 
to the lexicon, thereby creating a lattice. A dyna- 
mic programming technique is then used to find 
tag the sequence T\ , . . . , T n that maximizes 

P(Ti,...,T„ I W 1} ...,W n ) = 

n 

= l[P(T k \T 1 ,...,T k _ 1 ;W 1 ,...,W n ) « 

k = l 

n 

~ l[P(T k \T k _ N+1 ,...,T k _ 1 ;W k ) « 
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n 
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n 

n 

k = l 



P(T k \T k _ N+1 ,...,T k _ 1 )-P(T k I W k ) 
P(Tk) 

P(T k \T k _ N+1 ,...,T k _ 1 )-P(W k \ T k ) 



P(W k ) 



standard statistical PoS tagging task: 

n 

T max l[P(T k I T k _ N+1 ,...,T k _ 1 )-P(W k \ T k ) 

" k = l 

This is well-described in for example (DeRose 
1988). 

We thus have to estimate the two following sets 
of probabilities: 

• Lexical probabilities: 

The probability of each tag T % conditional on 
the word W that is to be tagged, _P(T 8 ' | W). 
Often the converse probabilities P(W \ T l ) 
are given instead, but we will for reasons soon 
to become apparent use the former formula- 
tion. 

• Tag TV-grams: 

The probability of tag T % at position k in 
the input string, denoted T|, given that tags 
T)%_jv+i, • • - ,T k -\ have been assigned to the 
previous TV— 1 words. Often TV is set to two or 
three, and thus bigrams or trigrams are em- 
ployed. When using trigram statistics, this 
quantity is P{T l k \ T k _ 2 ,T k -i). 

3.1 TV-gram Back-off Smoothing 

We will first consider estimating the TV-gram pro- 
babilities P(Tl I X)%_jv+i , • • • , T k -\). Here, there 
is an obvious sequence of generalizations of the 
context X)%_jv+i , • • • , T k _\ with a linear order, na- 
mely T k _N+i, ■ ■ ■ , T k -\ C T k _N+2, ■ ■ ■ , T k -\ C 
. . . C T k -\ C SI, where SI means "no information" , 
corresponding to the unigram probabilities. Thus 
we will repeatedly strip off the tag furthest from 
the current word and use the estimate of the pro- 
bability distribution in this generalized context to 
improve the estimate in the current context. This 
means that when estimating the (j + l)-gram pro- 
babilities, we back off to the estimate of the j- 
gram probabilities. 

So when estimating P(Tl \ T k _j , . . . , T k -\), we 
simply strip off the tag T k _j and apply Eq. (1): 

P(T k \T k . ]) ... ) T k _ 1 ) = 
0—1 + 1 



a- 1 + 1 



and 



a- 1 = VY2 J\T k _ h . . . ^n.^e-^-^'-^-^ 



Since the maximum does not depend on the fac- 
tors P(W k ), these can be omitted, yielding the 



3.2 Handling Unknown Words 

We will next consider improving the probability 
estimates for unknown words, i.e., words that do 
not occur in the training corpus, and for which we 
therefore have no lexical probabilities. The same 
technique could actually be used for improving the 
estimates of the lexical probabilities of words that 



do occur in the training corpus. The basic idea is 
that there is a substantial amount of information 
in the word suffixes, especially for languages with 
a richer morphological structure than English. For 
this reason, we will estimate the probability distri- 
bution conditional on an unknown word from the 
statistical data available for words that end with 
the same sequence of letters. Assume that the 
word consists of the letters l\ , . . . , l n . We want to 
know the probabilities P(T l \ li,...,l n ) for the 
various tags T % } Since the word is unknown, this 
data is not available. However, if we look at the 
sequence of generalizations of "ending with same 
last j letters" , here denoted l n -j+i, . . . , l n , we rea- 
lize that sooner or later, there will be observations 
available, in the worst case looking at the last zero 
letters, i.e., at the unigram probabilities. 

So when estimating P(T l | l n -j+i , . . . , l n ), we 
simply omit the jth last letter l n -j+i and apply 
Eq. (1): 

P{T l | / n _j + i, . . . , l n ) = 

_ C" 1 f{T l | l n -j + l, . . . , l n ) 
P(T l | l n -j+2, . . . , l n ) 



-H[f„_ i+2 ,...,f„] 



12 yj\l n -j + l, ■ ■ ■ , In | 1 



This data can be collected from the words in the 
training corpus with frequencies below some thres- 
hold, e.g., words that occur less than say ten 
times, and can be indexed in a tree on reversed 
suffixes for quick access. 

4 Partial Successive Abstraction 

If there is only a partial order of the various gene- 
ralizations, the scheme is still viable. For example, 
consider generalizing symmetric trigram statistics, 
i.e., statistics of the form P(T \ Ti,T r ). Here, both 
Ti, the tag of the word to the left, and T r , the tag 
of the word to the right, are one-step generaliza- 
tions of the context Ti,T r , and both have in turn 
the common generalization £1 ("no information"). 
We modify Eq. (1) accordingly: 

aiTuTr)- 1 f(T \T h T r ) , 



P(T\T,,T r ) 



and 



P(T | Tt) 
P(T | T r ) 



<Z},T r )-i + 1 

1 P(T | Ti) + P(T | T r ) 

2 <Z},T r )-i + 1 

a^)- 1 f(T \T l ) + P(T) 

^(T,)" 1 + 1 
ajTr)- 1 f(T \T r ) + P(T) 

a(Tr)- 1 + 1 



1 Or really, P(T* \ lo , h , ■ ■ ■ , l n ) where lo is a special 
symbol indicating the beginning of the word. 



We call this partial successive abstraction. Since 
we really want to estimate a in the more specific 
context, and since the standard deviation (with 
the dependence on context size factored out) will 
most likely not increase when we specialize the 
context, we will use: 



(r(T h T r ) 



1 



min(<7o(X/), a (T r )) 



In the general case, where we have M one-step 
generalizations C[ of C , we arrive at the equation 



P(x | C) 



<T(C)- 1 f(x\C) + jfY?=iP{*\W 



aiC)- 1 + 1 



a(C) 



1 



\C\ ie{i,...,M} 
12e -H[c|] 



Mc<) 



By calculating the estimates of the probability 
distributions in such an order that whenever esti- 
mating the probability distribution in some parti- 
cular context, the probability distributions in all 
more general contexts have already been estima- 
ted, we can guarantee that all quantities necessary 
for the calculations are available. 

5 Relationship to Other Methods 

We will next compare the proposed method to, 
in turn, deleted interpolation, expected likelihood 
estimation and Katz's back-off scheme. 

5.1 Deleted Interpolation 

Interpolation requires that the training corpus is 
divided into one part used to estimate the relative 
frequencies, and a separate held-back part used 
to cope with sparse data through back-off smoo- 
thing. For example, tag trigram probabilities can 
be estimated as follows: 

Pinin.^n^) « a^) + 

+ \ 2 f(V k \ T k . l ) + \ 3 f(T k \ T k . 2 ,T k . l ) 

Since the probability estimate is a linear combina- 
tion of the various observed relative frequencies, 
this is called linear interpolation. The weights Xj 
may depend on the conditionings, but are required 
to be nonnegative and to sum to one over j. An 
enhancement is to partition the training set into 
n parts and in turn perform linear interpolation 
with each of the n parts held out to determine 
the back-off weights and use the remaining n — 1 
parts for parameter estimation. The various back- 
off weights are combined in the process. This is 
usually referred to as deleted interpolation. 

The weights Xj are determined by maximizing 
the probability of the held-out part of the trai- 
ning data, see (Jelinek & Mercer 1980). A locally 



optimal weight setting can be found using Baum- 
Welch reestimation, see (Baum 1972). Baum- 
Welch reestimation is however prohibitively time- 
consuming for complex contexts if the weights are 
allowed to depend on the contexts, while succes- 
sive abstraction is clearly tractable; the latter ef- 
fectively determines these weights directly from 
the same data as the relative frequencies. 

5.2 Expected Likelihood Estimation 

Expected likelihood estimation (ELE) consists in 
assigning an extra half a count to all outcomes. 
Thus, an outcome that didn't occur in the trai- 
ning data receives half a count, an outcome that 
occurred once receives three half counts. This is 
equivalent to assigning a count of one to the oc- 
curring, and one third to the non-occurring out- 
comes. To give an indication of how successive 
abstraction is related to ELE, consider the fol- 
lowing special case: If we indeed have a uniform 
distribution with M outcomes of probability in 
context C'k-i and there is but one observation of 
one single outcome in context C'k, then Eq. (1) will 
assign to this outcome the probability .y^+Af anc ^ 
to the other, non-occurring, outcomes ^i +M ■ So 

if we had used 2 instead of \/l2 in Eq. (1), this 
would have been equivalent to assigning a count 
of one to the outcome that occurred, and a count 
of one third to the ones that didn't. As it is, the 
latter outcomes are assigned a count of J- 

\f Y Y 

5.3 Katz's Back-Off Scheme 

The proposed method is identical to Katz's back- 
off method (Katz 1987) up to the point of sugge- 
sting a, in the general case non-linear, retreat to 
more general contexts: 

P(x\C) = g(f(x\C),P(x\C')) 

Blending the involved distributions f(x \ C) and 
P(x | C"), rather than only backing off to C" if 
f(x | C) is zero, and in particular, instantiating 
the function g( f, P) to a weighted sum, distinguis- 
hes the two approaches. 

6 Experiments 

A standard statistical trigram tagger has been im- 
plemented that uses linear successive abstraction 
for smoothing the trigram and bigram probabili- 
ties, as described in Section 3.1, and that handles 
unknown words using a reversed suffix tree, as de- 
scribed in Section 3.2, again using linear succes- 
sive abstraction to improve the probability esti- 
mates. This tagger was tested on the Susanne 
Corpus, (Sampson 1995), using a reduced tag set 
of 62 tags. The size of the training corpus A was 
almost 130,000 words. There were three separate 
test corpora B, C and D consisting of approxima- 
tely 10,000 words each. 
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Figure 1: Results on the Susanne Corpus 



The performance of the tagger was compared 
with that of an HMM-based trigram tagger that 
uses linear interpolation for TV-gram smoothing, 
but where the back-off weights do not depend on 
the conditionings. An optimal weight setting was 
determined for each test corpus individually, and 
used in the experiments. Incidentally, this setting 
varied considerably from corpus to corpus. Thus, 
this represented the best possible setting of back- 
off weights obtainable by linear interpolation, and 
in particular by linear deleted interpolation, when 
these are not allowed to depend on the context. 

In contrast, the successive abstraction scheme 
determined the back-off weights automatically 
from the training corpus alone, and the same 
weight setting was used for all test corpora, yiel- 
ding results that were at least on par with those 
obtained using linear interpolation with a globally 
optimal setting of context-independent back-off 
weights determined a posteriori. Both taggers 
handled unknown words by inspecting the suffi- 
xes, but the HMM-based tagger did not smooth 
the probability distributions. 

The experimental results are shown in Figure 1. 
Note that the absolute performance of the trigram 
tagger is around 96 % accuracy in two cases and 
distinctly above 95 % accuracy in all cases, which 
is clearly state-of-the-art results. Since each test 
corpus consisted of about 10,000 words, and the 
error rates are between 4 and 5 %, the 5 percent 
significance level for differences in error rate is bet- 
ween 0.39 and 0.43 % depending on the error rate, 
and the 10 percent significance level is between 
0.32 and 0.36 %. 



We see that the trigram tagger is better than 
the bigram tagger in all three cases and signifi- 
cantly better at significance level 10 percent, but 
not at 5 percent, in case C. So at this signifi- 
cance level, we can conclude that smoothed tri- 
gram statistics improve on bigram statistics alone. 
The trigram tagger performed better than the 
HMM-based one in all three cases, but not sig- 
nificantly better at any significance level below 
10 percent. This indicates that the successive 
abstraction scheme yields back-off" weights that 
are at least as good as the best ones obtainable 
through linear deleted interpolation with context- 
independent back-off weights. 

7 Summary and Further Directions 

In this paper, we derived a general, practical me- 
thod for handling sparse data from first principles 
that avoids held-out data and iterative reestima- 
tion. It was tested on a part-of-speech tagging 
task and outperformed linear interpolation with 
context-independent weights, even when the lat- 
ter used a globally optimal parameter setting de- 
termined a posteriori. 

Informal experiments indicate that it is possible 
to achieve slightly better performance by replacing 
the expression for cr^" 1 (Cfc) with a fixed global con- 
stant (while retaining the factor -^==, which is 

most likely a quite accurate model of the depen- 
dence on context size). However, the optimal va- 
lue for this parameter varied more than an order 
of magnitude, and the improvements in perfor- 
mance were not very large. Furthermore, subop- 
timal choices of this parameter tended to degrade 
performance, rather than improve it. This indi- 
cates that the proposed formula is doing a pretty 
good job of approximating an optimal parameter 
choice. It would nonetheless be interesting to see 
if the formula could be improved on, especially 
seeing that it was theoretically derived, and then 
directly applied to the tagging task, immediately 
yielding the quoted results. 
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